{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ChnSentiCorp_htl_all 说明\n",
    "0. **下载地址：** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv)\n",
    "1. **数据概览：** 7000 多条酒店评论数据，5000 多条正向评论，2000 多条负向评论\n",
    "2. **推荐实验：** 情感/观点/评论 倾向性分析\n",
    "2. **数据来源：**[携程网](http://www.ctrip.com/)\n",
    "3. **原数据集：** ChnSentiCorp_htl，由 [谭松波](http://people.ucas.ac.cn/~0012244) 老师整理的一份数据集\n",
    "4. **加工处理：**\n",
    "    1. 将原来 1 万个离散的文件整合到 1 个文件中\n",
    "    2. 将负向评论的 label 从 -1 改成 0\n",
    "    3. 去重"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'ChnSentiCorp_htl_all_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. ChnSentiCorp_htl_all.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：7766\n",
      "评论数目（正向）：5322\n",
      "评论数目（负向）：2444\n"
     ]
    }
   ],
   "source": [
    "pd_all = pd.read_csv(path + 'ChnSentiCorp_htl_all.csv')\n",
    "\n",
    "print('评论数目（总体）：%d' % pd_all.shape[0])\n",
    "print('评论数目（正向）：%d' % pd_all[pd_all.label==1].shape[0])\n",
    "print('评论数目（负向）：%d' % pd_all[pd_all.label==0].shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| label | 1 表示正向评论，0 表示负向评论 |\n",
    "| review | 评论内容 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5612</th>\n",
       "      <td>0</td>\n",
       "      <td>房间小得无法想象,建议个子大的不要选择,一般的睡觉脚也伸不直.房间不超过10平方,彩电是14...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7321</th>\n",
       "      <td>0</td>\n",
       "      <td>我们一家人带孩子去过“五.一”，在协程网上挑了半天才选中的酒店，但看来还是错了。1.酒店除了...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3870</th>\n",
       "      <td>1</td>\n",
       "      <td>周六到西山去采橘子,路过这家酒店的时候就觉得应该不错的,采好橘子回来天也晚了,就临时决定住在...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4057</th>\n",
       "      <td>1</td>\n",
       "      <td>交通很便利,到渔人码头和港澳码头都在步行的范围之内.CHECKIN和CHECKOUT的速度都...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1452</th>\n",
       "      <td>1</td>\n",
       "      <td>很不错的一个酒店,床很大,很舒服.酒店员工的服务态度很亲切.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4805</th>\n",
       "      <td>1</td>\n",
       "      <td>酒店环境和服务都还不错，地理位置也不错，尤其是酒店北面的川北凉粉确实好吃，不过就是隔音效果不...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6868</th>\n",
       "      <td>0</td>\n",
       "      <td>旧楼改建的酒店，期望不要太高。酒店经理的态度很好，会帮助解决问题。有一位前台小姐的态度实在是...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1345</th>\n",
       "      <td>1</td>\n",
       "      <td>经常去海口出差,但从没住过该酒店.看外表感觉一般吧其实酒店里面还真不错,房间是新装修的(我住...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2026</th>\n",
       "      <td>1</td>\n",
       "      <td>算是海口市比较好的酒店了。处于市中心，购物方便。服务态度好。保险柜出问题了叫人来开，打个电话...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2805</th>\n",
       "      <td>1</td>\n",
       "      <td>感受的是热情的服务！从入门开始，一直很愉快！房间硬件只是准2星的吧，卫生间淋浴头在马桶上方，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2915</th>\n",
       "      <td>1</td>\n",
       "      <td>房间很整洁，尤其是床上的哪个靠枕是我以前所住过宾馆没有的，红色的很喜庆。虽然是在当地比较繁华...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1803</th>\n",
       "      <td>1</td>\n",
       "      <td>准确的说，酒店的环境很漂亮，房间设施也还行，可以算4星标准。但是，卫生间下水道的气味实在是让...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4729</th>\n",
       "      <td>1</td>\n",
       "      <td>价格越来越高了,周遍不方便,去哪里都需要打车.不过装修风格很时尚舒适.服务态度不错.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1913</th>\n",
       "      <td>1</td>\n",
       "      <td>地理位置不错。但好像人气不太旺。不过下次也会考虑住这的。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7159</th>\n",
       "      <td>0</td>\n",
       "      <td>设施老化，紧靠马路噪音太大。晚上楼上卫生间的水流声和空调噪音非常大，无法入眠，跟总台反映后，...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1119</th>\n",
       "      <td>1</td>\n",
       "      <td>11月份住了一次。1.服务方面还不错，门童挺积极。2.感觉房间略有陈旧。3.早餐品种还算丰富...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2170</th>\n",
       "      <td>1</td>\n",
       "      <td>总的来说，酒店还不错。比较安静，地理位置比较好，服务也不错，包括入住和结账。不太好的地方，7...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2793</th>\n",
       "      <td>1</td>\n",
       "      <td>我喜欢那里,性价比很高地.去太原90%都住在那里的.服务员的服务很不错</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5895</th>\n",
       "      <td>0</td>\n",
       "      <td>非常糟糕！1。我们通过其商务中心包了一辆车游西湖，该车拉我们去不正规景点买茶叶（我们买了），...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4089</th>\n",
       "      <td>1</td>\n",
       "      <td>我是7月9号晚10点多的时候入住的，房间很新，据说是跟格林豪泰是同一公司的，可能是是新开业的...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      label                                             review\n",
       "5612      0  房间小得无法想象,建议个子大的不要选择,一般的睡觉脚也伸不直.房间不超过10平方,彩电是14...\n",
       "7321      0  我们一家人带孩子去过“五.一”，在协程网上挑了半天才选中的酒店，但看来还是错了。1.酒店除了...\n",
       "3870      1  周六到西山去采橘子,路过这家酒店的时候就觉得应该不错的,采好橘子回来天也晚了,就临时决定住在...\n",
       "4057      1  交通很便利,到渔人码头和港澳码头都在步行的范围之内.CHECKIN和CHECKOUT的速度都...\n",
       "1452      1                     很不错的一个酒店,床很大,很舒服.酒店员工的服务态度很亲切.\n",
       "4805      1  酒店环境和服务都还不错，地理位置也不错，尤其是酒店北面的川北凉粉确实好吃，不过就是隔音效果不...\n",
       "6868      0  旧楼改建的酒店，期望不要太高。酒店经理的态度很好，会帮助解决问题。有一位前台小姐的态度实在是...\n",
       "1345      1  经常去海口出差,但从没住过该酒店.看外表感觉一般吧其实酒店里面还真不错,房间是新装修的(我住...\n",
       "2026      1  算是海口市比较好的酒店了。处于市中心，购物方便。服务态度好。保险柜出问题了叫人来开，打个电话...\n",
       "2805      1  感受的是热情的服务！从入门开始，一直很愉快！房间硬件只是准2星的吧，卫生间淋浴头在马桶上方，...\n",
       "2915      1  房间很整洁，尤其是床上的哪个靠枕是我以前所住过宾馆没有的，红色的很喜庆。虽然是在当地比较繁华...\n",
       "1803      1  准确的说，酒店的环境很漂亮，房间设施也还行，可以算4星标准。但是，卫生间下水道的气味实在是让...\n",
       "4729      1         价格越来越高了,周遍不方便,去哪里都需要打车.不过装修风格很时尚舒适.服务态度不错.\n",
       "1913      1                       地理位置不错。但好像人气不太旺。不过下次也会考虑住这的。\n",
       "7159      0  设施老化，紧靠马路噪音太大。晚上楼上卫生间的水流声和空调噪音非常大，无法入眠，跟总台反映后，...\n",
       "1119      1  11月份住了一次。1.服务方面还不错，门童挺积极。2.感觉房间略有陈旧。3.早餐品种还算丰富...\n",
       "2170      1  总的来说，酒店还不错。比较安静，地理位置比较好，服务也不错，包括入住和结账。不太好的地方，7...\n",
       "2793      1                我喜欢那里,性价比很高地.去太原90%都住在那里的.服务员的服务很不错\n",
       "5895      0  非常糟糕！1。我们通过其商务中心包了一辆车游西湖，该车拉我们去不正规景点买茶叶（我们买了），...\n",
       "4089      1  我是7月9号晚10点多的时候入住的，房间很新，据说是跟格林豪泰是同一公司的，可能是是新开业的..."
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. 构造平衡语料\n",
    "\n",
    "- 原数据集中还包含了 3 份平衡语料：ChnSentiCorp_htl_ba_2000, ChnSentiCorp_htl_ba_4000, ChnSentiCorp_htl_ba_6000\n",
    "- 用随机抽样的方法，很容易构造出类似的平衡语料"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_positive = pd_all[pd_all.label==1]\n",
    "pd_negative = pd_all[pd_all.label==0]\n",
    "\n",
    "def get_balance_corpus(corpus_size, corpus_pos, corpus_neg):\n",
    "    sample_size = corpus_size // 2\n",
    "    pd_corpus_balance = pd.concat([corpus_pos.sample(sample_size, replace=corpus_pos.shape[0]<sample_size), \\\n",
    "                                   corpus_neg.sample(sample_size, replace=corpus_neg.shape[0]<sample_size)])\n",
    "    \n",
    "    print('评论数目（总体）：%d' % pd_corpus_balance.shape[0])\n",
    "    print('评论数目（正向）：%d' % pd_corpus_balance[pd_corpus_balance.label==1].shape[0])\n",
    "    print('评论数目（负向）：%d' % pd_corpus_balance[pd_corpus_balance.label==0].shape[0])    \n",
    "    \n",
    "    return pd_corpus_balance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：2000\n",
      "评论数目（正向）：1000\n",
      "评论数目（负向）：1000\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5536</th>\n",
       "      <td>0</td>\n",
       "      <td>建议携程不要和这家酒店合作,名曰三星,要我看准星级都勉强!首先不在市区里面(去涵江区打车还要...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4086</th>\n",
       "      <td>1</td>\n",
       "      <td>感觉比老街口客栈舒适，很中规中矩的3星级，推荐大家住主楼的豪华间，设施比较好，前台和大堂的服...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6112</th>\n",
       "      <td>0</td>\n",
       "      <td>是我遇到的最差的4星酒店，进门没人管，进去要我和大堂打招呼，退房也很慢，不会再去住了</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4440</th>\n",
       "      <td>1</td>\n",
       "      <td>房间的设施不错，由于武夷山市是个小地方，酒店离景区有一定距离，如果没有自己开车就不太方便，但...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2706</th>\n",
       "      <td>1</td>\n",
       "      <td>首次入住该酒店,环境雅致,服务非常不错,很多笑脸,感觉热情,早餐可以接受,有送餐服务以后去徐...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1770</th>\n",
       "      <td>1</td>\n",
       "      <td>不错!就是洗澡的地方小点~~下回去还住这家~~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4306</th>\n",
       "      <td>1</td>\n",
       "      <td>环境位置很好,房间情况尚可,早餐一般般,价格偏高了一些.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2161</th>\n",
       "      <td>1</td>\n",
       "      <td>位置优越，出行方便。就是房间较小，床位较小，房间装修较旧，其他方面都不错。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7667</th>\n",
       "      <td>0</td>\n",
       "      <td>酒店周围环境差，内部也很旧，卫生不好，很脏，总之没什么好的，下次决不住这。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4419</th>\n",
       "      <td>1</td>\n",
       "      <td>我7月24号入住瑞豪酒店，开始有些不顺利，但是那里的管理还是非常好的，有位姓赵的经理发现问题...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      label                                             review\n",
       "5536      0  建议携程不要和这家酒店合作,名曰三星,要我看准星级都勉强!首先不在市区里面(去涵江区打车还要...\n",
       "4086      1  感觉比老街口客栈舒适，很中规中矩的3星级，推荐大家住主楼的豪华间，设施比较好，前台和大堂的服...\n",
       "6112      0         是我遇到的最差的4星酒店，进门没人管，进去要我和大堂打招呼，退房也很慢，不会再去住了\n",
       "4440      1  房间的设施不错，由于武夷山市是个小地方，酒店离景区有一定距离，如果没有自己开车就不太方便，但...\n",
       "2706      1  首次入住该酒店,环境雅致,服务非常不错,很多笑脸,感觉热情,早餐可以接受,有送餐服务以后去徐...\n",
       "1770      1                            不错!就是洗澡的地方小点~~下回去还住这家~~\n",
       "4306      1                       环境位置很好,房间情况尚可,早餐一般般,价格偏高了一些.\n",
       "2161      1              位置优越，出行方便。就是房间较小，床位较小，房间装修较旧，其他方面都不错。\n",
       "7667      0              酒店周围环境差，内部也很旧，卫生不好，很脏，总之没什么好的，下次决不住这。\n",
       "4419      1  我7月24号入住瑞豪酒店，开始有些不顺利，但是那里的管理还是非常好的，有位姓赵的经理发现问题..."
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ChnSentiCorp_htl_ba_2000 = get_balance_corpus(2000, pd_positive, pd_negative)\n",
    "\n",
    "ChnSentiCorp_htl_ba_2000.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：4000\n",
      "评论数目（正向）：2000\n",
      "评论数目（负向）：2000\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3605</th>\n",
       "      <td>1</td>\n",
       "      <td>酒店就在海水浴场旁边，出门到接触到海水两分钟，如果要和海水亲近的朋友，极力推荐。这样游泳换衣...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7260</th>\n",
       "      <td>0</td>\n",
       "      <td>TheWorsehotelinChengdurightnow,checkoutat12.30...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5762</th>\n",
       "      <td>0</td>\n",
       "      <td>房间还算可以，不过前台服务人员的态度，受不了，我晚上11点多到酒店CHEKIN第二天退房的时...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5790</th>\n",
       "      <td>0</td>\n",
       "      <td>酒店设施陈旧，浴缸排水不畅，入住无房，一间16：00，一间22：00，早餐差</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4504</th>\n",
       "      <td>1</td>\n",
       "      <td>虽是公寓式酒店，但其房间整洁程度、全方位的服务都给我留下了很好的印象。丝丝不完善之处在于很多...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5246</th>\n",
       "      <td>1</td>\n",
       "      <td>很好的酒店，很喜欢，房间很干净很漂亮，从房间的窗口看出去，超美的，在市中心区域，出行也非常的...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>624</th>\n",
       "      <td>1</td>\n",
       "      <td>在临沂，这个酒店算是比较有档次的了，给外国客人的服务也比较合格。可惜电视内容比较单调，国外的...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1382</th>\n",
       "      <td>1</td>\n",
       "      <td>4年前住过，我和德国同事都觉得很不错。今年我又选了豪门，还是觉得很好。自助餐品种丰富，房间宽...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3723</th>\n",
       "      <td>1</td>\n",
       "      <td>价格不高,比较实惠,服务也不错,离闹市区不远.交通也比较方便.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3328</th>\n",
       "      <td>1</td>\n",
       "      <td>房间：建筑风格比较独特。木屋矗立在随潮汐涨落的水中，围廊象迷宫一样。看着自己的小屋，却没有直...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      label                                             review\n",
       "3605      1  酒店就在海水浴场旁边，出门到接触到海水两分钟，如果要和海水亲近的朋友，极力推荐。这样游泳换衣...\n",
       "7260      0  TheWorsehotelinChengdurightnow,checkoutat12.30...\n",
       "5762      0  房间还算可以，不过前台服务人员的态度，受不了，我晚上11点多到酒店CHEKIN第二天退房的时...\n",
       "5790      0             酒店设施陈旧，浴缸排水不畅，入住无房，一间16：00，一间22：00，早餐差\n",
       "4504      1  虽是公寓式酒店，但其房间整洁程度、全方位的服务都给我留下了很好的印象。丝丝不完善之处在于很多...\n",
       "5246      1  很好的酒店，很喜欢，房间很干净很漂亮，从房间的窗口看出去，超美的，在市中心区域，出行也非常的...\n",
       "624       1  在临沂，这个酒店算是比较有档次的了，给外国客人的服务也比较合格。可惜电视内容比较单调，国外的...\n",
       "1382      1  4年前住过，我和德国同事都觉得很不错。今年我又选了豪门，还是觉得很好。自助餐品种丰富，房间宽...\n",
       "3723      1                    价格不高,比较实惠,服务也不错,离闹市区不远.交通也比较方便.\n",
       "3328      1  房间：建筑风格比较独特。木屋矗立在随潮汐涨落的水中，围廊象迷宫一样。看着自己的小屋，却没有直..."
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ChnSentiCorp_htl_ba_4000 = get_balance_corpus(4000, pd_positive, pd_negative)\n",
    "\n",
    "ChnSentiCorp_htl_ba_4000.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "评论数目（总体）：6000\n",
      "评论数目（正向）：3000\n",
      "评论数目（负向）：3000\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4817</th>\n",
       "      <td>1</td>\n",
       "      <td>入住的是260元的迷你标准间。感觉比想象的要好很多，房间如果住一个人很合适的，洗手间很大，很...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7021</th>\n",
       "      <td>0</td>\n",
       "      <td>7点到了酒店前台打电话问了楼层说房间可以入住，上楼竟然房间的垃圾成堆根本就没有打扫，下楼要求...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6484</th>\n",
       "      <td>0</td>\n",
       "      <td>又要对他进行点评了，呜呜。。。说什么好呢</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6715</th>\n",
       "      <td>0</td>\n",
       "      <td>看了前面介绍的推荐去入住的，结果很失望，酒店的淋浴居然没有维护设施，洗个澡弄得整个洗手间都淋...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6775</th>\n",
       "      <td>0</td>\n",
       "      <td>酒店的设施太差了，估计连1星级都没有，房间空调都不开的，简直就是一塌糊涂。建议大家不要去预订该酒店</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7575</th>\n",
       "      <td>0</td>\n",
       "      <td>真的差得没话说，但说起来又有一堆。住进去的时候发现没有浴巾，第二天却一直打电话说我们拿了那两...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1615</th>\n",
       "      <td>1</td>\n",
       "      <td>酒店非常好，距离高速出口很近，服务也很到位，值得推荐的酒店，到泰山应该是最好的酒店了．</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6466</th>\n",
       "      <td>0</td>\n",
       "      <td>携城预定员极力推荐这家酒店，相信她才入住了这家，结果到了酒店才发现，连一星级都不如，前台的小...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1392</th>\n",
       "      <td>1</td>\n",
       "      <td>酒店很大，服务太差，Ａ楼房间也老，下次再也不住了。环境很好，打高尔夫的或许可以忍忍吧。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4408</th>\n",
       "      <td>1</td>\n",
       "      <td>房间很大，大的让我去其他宾馆都感觉性价比不高！服务也不错，值得一住！！</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      label                                             review\n",
       "4817      1  入住的是260元的迷你标准间。感觉比想象的要好很多，房间如果住一个人很合适的，洗手间很大，很...\n",
       "7021      0  7点到了酒店前台打电话问了楼层说房间可以入住，上楼竟然房间的垃圾成堆根本就没有打扫，下楼要求...\n",
       "6484      0                               又要对他进行点评了，呜呜。。。说什么好呢\n",
       "6715      0  看了前面介绍的推荐去入住的，结果很失望，酒店的淋浴居然没有维护设施，洗个澡弄得整个洗手间都淋...\n",
       "6775      0  酒店的设施太差了，估计连1星级都没有，房间空调都不开的，简直就是一塌糊涂。建议大家不要去预订该酒店\n",
       "7575      0  真的差得没话说，但说起来又有一堆。住进去的时候发现没有浴巾，第二天却一直打电话说我们拿了那两...\n",
       "1615      1        酒店非常好，距离高速出口很近，服务也很到位，值得推荐的酒店，到泰山应该是最好的酒店了．\n",
       "6466      0  携城预定员极力推荐这家酒店，相信她才入住了这家，结果到了酒店才发现，连一星级都不如，前台的小...\n",
       "1392      1        酒店很大，服务太差，Ａ楼房间也老，下次再也不住了。环境很好，打高尔夫的或许可以忍忍吧。\n",
       "4408      1                房间很大，大的让我去其他宾馆都感觉性价比不高！服务也不错，值得一住！！"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ChnSentiCorp_htl_ba_6000 = get_balance_corpus(6000, pd_positive, pd_negative)\n",
    "\n",
    "ChnSentiCorp_htl_ba_6000.sample(10)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
